In this assessment, you need to answer all the questions about KNN, Linear Regression, Regularization, Logistic Regression, K-fold cross-validation, and other concepts covered in Module 1-3. R studio is recommended to use to complete your assessment. All codes need comments to help markers to understand your idea. If no comment is given, you may have a 10% redundancy on your mark. Please refer to weekly activities as examples for how to write comments. After answering all the questions, please knit your R notebook file to HTML or PDF format. Submit both .rmd file and .html or .pdf file to assessment 1 dropbox via the link on the Assessment page. You can compress your files into a zip file for submission. The total mark of this assessment is 100, which worths 30% of your final result.

# import libraries
library(ggplot2)
library(reshape)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
library(gtable)
library(stringr)
library(caret)
## Loading required package: lattice
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(AICcmodavg)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:reshape':
## 
##     expand
## Loaded glmnet 4.1-2
library(corrplot)
## corrplot 0.90 loaded

hint: Please review all reading materials in Module 1-3 carefully, especially the activities.

Question 1 - KNN (20 marks)

In this question, you are required to implement a KNN classifier to predict the class of cancer clumps. The breast cancer dataset is used in this question. A detailed description of this data set can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29.

Specifically, you need to:

  1. Split the data set into a training and a test set with the ratio of 7:3. (1 mark)
  2. Implement KNN classifiers with different K values (from 1 to 15). (5 marks)
  3. Investigate the impact of different K values (from 1 to 15) on the model performance (ACC) and the impacts of different distance measurements (Euclidean, Manhattan, Canberra, and Minkowski) on the model performance (ACC). Visualize and discuss your findings. (10 marks)
  4. Implement the Leave-one-out cross-validation for your KNN classifier with the best K value from the above steps. (4 marks)

Q1.0 - Data Cleansing

## 'data.frame':    699 obs. of  10 variables:
##  $ clumpThickness          : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ sizeUniformity          : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ shapeUniformity         : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginalAdhesion        : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ singleEpithelialCellSize: int  2 7 2 3 2 7 2 2 2 2 ...
##  $ bareNuclei              : int  1 10 2 4 1 10 10 1 1 1 ...
##  $ blandChromatin          : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ normalNucleoli          : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitosis                 : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ class                   : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
##           clumpThickness           sizeUniformity          shapeUniformity 
##                        0                        0                        0 
##         marginalAdhesion singleEpithelialCellSize               bareNuclei 
##                        0                        0                       16 
##           blandChromatin           normalNucleoli                  mitosis 
##                        0                        0                        0 
##                    class 
##                        0

We can see there are 16 null values in the bareNuclei column. I will remove the rows with null values

## [1] 683  10

Plotting the data I will look for outliers.

I can see that all of the data lies within the expected ranges (1-10) as based on the data info, so no need to remove any outliers. I can also see that the class column was also converted for all rows correctly.

Q1.1: Split the data set into a training and a test set with the ratio of 7:3 (1 mark)

## [1] 478   9
## [1] 205   9

Q1.2: Implement KNN classifiers with different K values (from 1 to 15) (5 marks)

Assessing KNN Classifier

Q1.3: Investigate the impact of different K values (from 1 to 15) on the model performance (ACC) and the impacts of different distance measurements (Euclidean, Manhattan, Canberra, and Minkowski) on the model performance (ACC). Visualize and discuss your findings. (10 marks)

Discussion

A brief explanation of each of the metrics:

  • Euclidean: represents the shortest distance between two points
  • Manhattan: the sum of absolute differences between points across all the dimensions (ie. the cityblock method)
  • Minkowski: the generalized form of Euclidean and Manhattan Distance
  • Canberra: a weighted version of the Manhattan distance.

All of the 4 metrics give very similar clustering results, with Canberra giving the highest % accuracy with K values of 8, 10, 11, 12, 13, and 14, Minkowski and Euclidean have produced identical results, and Manhattan produced the lowest accuracy.

From my research, Minkowski is actually a generalised form of Euclidean and Manhattan it will calculate a different distance based on the ‘p’ value that is passed. If p=1 is passed, it will return the Manhattan distance. If p=2 is passed, it will return the Euclidean distance. From the documentation of the dist() function (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist), we can see that p=2 is actually the default value that is passed when p is not specified. This explains why the outputs of Minkowski and Euclidean are identical.

I also plotted Minkowski with p=1, to compare the output to Manhattan.

## [[1]]
## [1] "Minkowski , p= 1"
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 97.56098

As we can see, it returns the same result as the Manhattan distance when passed p=1
It is interesting to see that all of the other distances produced the maximum accuracy value at multiple K-values, except for Euclidean and Minkowski p=2.

As for the Canberra distance, it is often used for data scattered around an origin, as it is biased for measures around the origin and very sensitive for values close to zero. It is able to take into account several types of data (binary, categorical, quantitative, etc.). I would say that the Canberra distance would be quite well suited to this dataset - there are multiple variables to learn from and it is often used in to identify anomolies in datasets.

It should also be noted that the Canberra distances produced the highest accuracy at a number of cluster values. However, I will go with the highest value as it has the lowest model complexity and less noise in the model. Therefore, I will say that the Canberra distance has produced the highest accuracy clustering model with K=14 as the ideal number of clusters.

Q1.4: Implement the Leave-one-out cross-validation for your KNN classifier with the best K value from the above steps. (4 marks)

# using caret package
fit <- train(class~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 14),
             trControl  =  trainControl(method  = "LOOCV"),
             metric     = "Accuracy",
             data       = df)
fit
## k-Nearest Neighbors 
## 
## 683 samples
##   9 predictor
##   2 classes: 'benign', 'malignant' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 682, 682, 682, 682, 682, 682, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.966325  0.9259104
## 
## Tuning parameter 'k' was held constant at a value of 14

The LOOCV method has produced an accuracy of \(96.4869%\) when K = 14. This is lower than the accuracy determined by the Canberra distance, but is still within the ballpark, especially when taking into account the other distances are all around the same value. The Kappa value (Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)) is also impressively high indicating that this model has almost perfect agreement with the test and train data.

Question 2 - Linear Regression (35 marks)

In this question, you need to implement a linear regression model to predict the residuary resistance of sailing yachts. The data set used in this question can be found in ‘yacht_hydrodynamics.csv.’ The data set has 7 features, which are summarized as below.

Variations concern hull geometry coefficients and the Froude number:

  1. Longitudinal position of the center of buoyancy, adimensional.
  2. Prismatic coefficient, adimensional.
  3. Length-displacement ratio, adimensional.
  4. Beam-draught ratio, adimensional.
  5. Length-beam ratio, adimensional.
  6. Froude number, adimensional.

The measured variable is the residuary resistance per unit weight of displacement:

  1. Residuary resistance per unit weight of displacement, adimensional.

Specifically, you need to:

  1. Perform data pre-processing, including removing invalid data, transforming the categorical features to numerical features or other operations if necessary. (4 marks)
  2. Split the data set into a training set and a test set, with the ratio of 8:2. (1 mark)
  3. Implement stochastic gradient descent to train a linear regression model with your training data. Visualize the parameter updating process, test error (RMSE) in each iteration, and cost convergence process. Please be advised that built-in models in any released R package, like glm, are NOT allowed to use in this question. You can choose your preferred learning rate and determine the best iteration number. (8 marks)
  4. Evaluate your model by calculating the RMSE, and visualizing the residuals of test data. Please note that an explanation of your residual plot is needed. (5 marks)
  5. Does your model overfit? Which features do you think are not significant? Please justify your answers. For example, you can analyze the significance of a feature from correlation, variance, etc. (8 marks)
  6. Use the glmnet library to built two linear regression models with Lasso and Ridge regularization, respectively. In comparison to your model, how well do these two models perform? Do the regularized models automatically filter out the less significant features? What are the differences between these two models? Please justify your answers. (8 marks)
df2 = read.csv('yacht_hydrodynamics.csv')

Q2.1: Perform data pre-processing, including removing invalid data, transforming the categorical features to numerical features or other operations if necessary. (4 marks)

str(df2)
## 'data.frame':    308 obs. of  7 variables:
##  $ V1: num  -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 ...
##  $ V2: num  0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 ...
##  $ V3: num  4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 ...
##  $ V4: num  3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 ...
##  $ V5: num  3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 ...
##  $ V6: num  0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 ...
##  $ V7: num  0.11 0.27 0.47 0.78 1.18 1.82 2.61 3.76 4.99 7.16 ...
colSums(is.na(df2))
## V1 V2 V3 V4 V5 V6 V7 
##  0  0  0  0  4  0  0

There is a total of 308 rows of data in the dataset, and that 4 rows in column V5 have null values. The values have also been read in as numerical data types which I will leave as is. I will remove the rows with null values and plot the data in boxplots to determine any significant outliers.

df2 <- na.omit(df2)
dim(df2)
## [1] 304   7

Plotting the data I will look for anomolies.

# First check for duplicated data
sum(duplicated(df2))
## [1] 0
scatterplotMatrix(df2, regLine = list(col="blue"), smooth=list(col.smooth="orange", col.spread="orange"), col='black')

meltData <- melt(df2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")

summary(df2)
##        V1               V2               V3              V4       
##  Min.   :-5.000   Min.   :0.5300   Min.   :4.340   Min.   :2.810  
##  1st Qu.:-2.400   1st Qu.:0.5460   1st Qu.:4.770   1st Qu.:3.750  
##  Median :-2.300   Median :0.5650   Median :4.780   Median :3.960  
##  Mean   :-2.381   Mean   :0.5641   Mean   :4.788   Mean   :3.937  
##  3rd Qu.:-2.300   3rd Qu.:0.5740   3rd Qu.:5.100   3rd Qu.:4.170  
##  Max.   : 0.000   Max.   :0.6000   Max.   :5.140   Max.   :5.350  
##        V5              V6               V7         
##  Min.   :2.730   Min.   :0.1250   Min.   : 0.0100  
##  1st Qu.:3.150   1st Qu.:0.2000   1st Qu.: 0.7675  
##  Median :3.150   Median :0.2875   Median : 3.0100  
##  Mean   :3.206   Mean   :0.2873   Mean   :10.5261  
##  3rd Qu.:3.510   3rd Qu.:0.3750   3rd Qu.:12.8150  
##  Max.   :3.640   Max.   :0.4500   Max.   :62.4200

There is no duplicated data, and there does not appear to be any values that are significant outliers - only the Target column has some extreme values above 30, but there are 44 rows of data here (approx. 15%) and I think this is significant enough to keep.

Next I will look at standardising the values in preparation for linear regression. This will be particularly important when we get to applying Lasso and Ridge Regularisation

##        V1               V2               V3               V4        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5200   1st Qu.:0.2286   1st Qu.:0.5375   1st Qu.:0.3701  
##  Median :0.5400   Median :0.5000   Median :0.5500   Median :0.4528  
##  Mean   :0.5238   Mean   :0.4866   Mean   :0.5596   Mean   :0.4436  
##  3rd Qu.:0.5400   3rd Qu.:0.6286   3rd Qu.:0.9500   3rd Qu.:0.5354  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##        V5               V6               V7         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.4615   1st Qu.:0.2308   1st Qu.:0.01214  
##  Median :0.4615   Median :0.5000   Median :0.04807  
##  Mean   :0.5228   Mean   :0.4995   Mean   :0.16850  
##  3rd Qu.:0.8571   3rd Qu.:0.7692   3rd Qu.:0.20518  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

Further looking at the data, V7 appears to be negatively exponential in shape, and V6 is very obviously not normally distributed. Looking at this data, I don’t think the relationship between the variables would be linear in nature.

There are also some apparent correlations between the variables:

  • V7, V6 (strong positive)
  • V5, V4 (moderately negative)
  • V5, V3 (strong positive)
  • V4, V3 (moderately positive)
  • V4, V2 (moderately positive)

Q2.2: Split the data set into a training set and a test set, with the ratio of 8:2. (1 mark)

## [1] 243   6
## [1] 61  6

Q2.3: Implement stochastic gradient descent to train a linear regression model with your training data. Visualize the parameter updating process, test error (RMSE) in each iteration, and cost convergence process. Please be advised that built-in models in any released R package, like glm, are NOT allowed to use in this question. You can choose your preferred learning rate and determine the best iteration number. (8 marks)

Auxilary Functions

Define an auxiliary function that calculates the prediction and cost based on the projected data and estimated coefficients.

# auxiliary function to calculate labels based on the estimated coefficients
predict_func <- function(Phi, w){
  # y(x,w) = Phi(x) * w
    return(Phi%*%w)
}

# The cost function for linear regression is the sum of squared error
error_func <- function (Phi, w, target){
    return(0.5 * sum((predict_func(Phi, w) - target)^2))
}

# RMSE function to calculate error 
rmse_func <- function (Phi, w, target){
  return(sqrt(mean((predict_func(Phi, w) - target)^2)))
}
SGD Function

Notes:

  • The loop will iterate until \(\tau\) tau.max OR the actual derivative (sum of squares) is less than epsilon

  • For the first iteration tau = 1 and W[tau,] is the runif randomly assigned weights in initialization step.

  • In each iteration, we shuffle the training data and hence Phi and T.

  • Each set of values for \(\pmb{w}^\tau\) is evaluated against the full set of \(\pmb{\phi}\) values hence error_func(Phi, W[tau,],T)

  • For each data point \(\pmb{\phi}_n\), we need to calculate \(\nabla E(\pmb{w}) = -(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n\) and then update the \(\pmb{w}^{\tau+1}\) by subtracting \(\eta * \nabla E(\pmb{w})\) from the current value of \(\pmb{w}^\tau\).

  • \(\pmb{w}^{\tau+1} := \pmb{w}^{\tau} - \eta \nabla E(\pmb{w}) = \pmb{w}^{\tau+1} - -\eta(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n = \pmb{w}^{\tau+1} + \eta(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n\) can be implemented as:
    \(W[(tau+1),j] := W[tau,j] + \eta \times (T[i]-t\_pred) \times Phi[i,j]\) where \(t\_pred\) is \(Phi[i,] %*% W[tau,]\)

## [1] "tau:  1000 error: 0.171746126754658"
## [1] "tau:  2000 error: 0.147349879977765"
## [1] "tau:  3000 error: 0.14337344532503"
## [1] "tau:  4000 error: 0.144140519480111"
## [1] "The final coefficents are: "
##          X0          X1          X2          X3          X4          X5 
## -0.19062143  0.04058752 -0.01763769 -0.05003362  0.06263530  0.06526828 
##          X6 
##  0.63503127
## [1] "The final training error is: 0.142716459101373"
## [1] "The final testing error is: 0.145290951967881"
# prepare error dataframe for visualization:
error.m <- melt(a$error, id='tau')
names(error.m)<-c('tau','dataset','RMSE')

# plot it
ggplot(data=error.m, aes(x=tau, y=RMSE, color=dataset)) +
    geom_line() + 
    ggtitle('Testing and Training RMSE Trends') +
    theme_minimal()

# convert W matrix to a dataframe and  melt for plotting
W.df <- as.data.frame(a$W); names(W.df) <- c('w0','w1','w2','w3','w4', 'w5', 'w6')
W.df$tau <- 1:nrow(W.df)

W.m <-melt(W.df, id='tau'); 
names(W.m) <- c('tau', 'coefficients', 'values')

ggplot(data=W.m, aes(x=tau, y=values, color=coefficients)) + geom_line() + ggtitle('Estimated Coefficients') + theme_minimal()

DISCUSSION

I have played around with a number of variations on the eta, tau.max, and the initial values of W. From what I have seen, the lowest RMSE values are produced when \(\eta = 0.01\). I have played around with running the model for more iterations, but both coefficients and error tend to get quite flat after the 3000 mark. I extended to 5000 to confirm the convergence of the cost.

Looking at the training vs. testing error graph, it is clear that the model performs quite similarly on both sets. In fact looking at the final error values for both it is clear that they are incredibly similar. It is interesting to see that there is a sharp drop around the 200th iteration where the model has started to perform better and the error begins to become quite small, however there is not that much difference from this point onwards through to the 5000th iteration - a 0.03 amount of difference in error. The test and train error rates also don’t seem to diverge at these numbers, so I would believe that the chosen learning rate of probably the best for this dataset.

Looking at the coefficient estimates over time, we can see that they were originally randomly generated, but all manage to plateau around the 3000th iteration, although there are still small jumps in the estimates. This is to be expected though, due to the random nature of SGD, but overall I think the model has learned well.

The initial value of W was another variable that I played around with. I tried \(W=0\), \(W=0.25\), \(W=0.5\), \(W=0.75\), \(W=1\), and \(W=random\). I found that there was little difference in the performance after 3000 iterations in all the models. Even though they jumped around quite a bit at the beginning, they all eventually converged around the 2000 iteration point, and flattened out further after 3000. This is probably due to the scaled and normalised nature of the data - the model doesn’t have too many positions to jump around and can find the optimal values quickly.

eta = 0.01, tau.max = 5000, W = random

  • “tau: 1000 error: 0.171746126754658”
  • “tau: 2000 error: 0.147349879977765”
  • “tau: 3000 error: 0.14337344532503”
  • “tau: 4000 error: 0.144140519480111”
  • “The final training error is: 0.142716459101373”
  • “The final testing error is: 0.145290951967881”

eta = 0.01, tau.max = 5000, W = 0

  • “tau: 1000 error: 0.161600195341595”
  • “tau: 2000 error: 0.14875552450342”
  • “tau: 3000 error: 0.14362057404834”
  • “tau: 4000 error: 0.142626338680638”
  • “The final training error is: 0.142306810595871”
  • “The final testing error is: 0.145231072727277”

eta = 0.01, tau.max = 5000, W = 0.25

  • “tau: 1000 error: 0.155304075556905”
  • “tau: 2000 error: 0.148003559594914”
  • “tau: 3000 error: 0.142617187364265”
  • “tau: 4000 error: 0.142999713900831”
  • “The final training error is: 0.142216415279971”
  • “The final testing error is: 0.144735190256315”

eta = 0.01, tau.max = 5000, W = 0.5

  • “tau: 1000 error: 0.156928582247015”
  • “tau: 2000 error: 0.14567736804452”
  • “tau: 3000 error: 0.142524500728116”
  • “tau: 4000 error: 0.144648457907774”
  • “The final training error is: 0.142604752076509”
  • “The final testing error is: 0.145306700306085”

eta = 0.01, tau.max = 5000, W = 0.75

  • “tau: 1000 error: 0.157478793743476”
  • “tau: 2000 error: 0.144311470323226”
  • “tau: 3000 error: 0.144696557749742”
  • “tau: 4000 error: 0.142365307325417”
  • “The final training error is: 0.143984204016896”
  • “The final testing error is: 0.146255628108942”

eta = 0.01, tau.max = 5000, W = 1

  • “tau: 1000 error: 0.163028465837712”
  • “tau: 2000 error: 0.148500835407268”
  • “tau: 3000 error: 0.143592957468202”
  • “tau: 4000 error: 0.144977071912628”
  • “The final training error is: 0.143166006987638”
  • “The final testing error is: 0.146781135499774”

Also the final test RMSE is \(\approx 0.14510\). I consider this to be a fairly low value considering the data is scaled from 0 to 1, so it appears the model is quite decent at predicting the values.

Q2.4: Evaluate your model by calculating the RMSE, and visualizing the residuals of test data. Please note that an explanation of your residual plot is needed. (5 marks)

print(paste('The final RMSE error is:', a$f_test_error))
## [1] "The final RMSE error is: 0.145290951967881"
# convert values to matrix
coeffs <- matrix()

for (i in 1:7) {
  coeffs <- cbind(coeffs, a$f_coeffs[[i]])
}

# remove null rows
coeffs <- coeffs[-1]

coeffs
## [1] -0.19062143  0.04058752 -0.01763769 -0.05003362  0.06263530  0.06526828
## [7]  0.63503127
evalModel <- function(learn_data, target_data, coeffs) {
  model_resids <- setNames(data.frame(matrix(ncol = 3)), c("Actual", "Predicted", "Residual"))
  
  size <- length(target_data)
  
  for (i in 1:size) {
    row <- as.matrix(cbind('v0'=1, learn_data[i,]))
    yhat <- sum(data.frame(mapply(`*`, row, coeffs)))
    y <- target_data[i]
    resid <- y - yhat
    model_resids <- rbind(model_resids, c(y, yhat, resid))
    
  }
  model_resids <- na.omit(model_resids)
  return(model_resids)
  
}

eval <- evalModel(test.data2, test.target2, matrix(coeffs))
eval
rss <- sum((eval$Predicted - eval$Actual) ^ 2)
tss <- sum((eval$Actual - mean(eval$Actual)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.6705301
print(paste("The model has a", round(rsq*100, digits=2), "% accuracy on the test data"))
## [1] "The model has a 67.05 % accuracy on the test data"
ggplot(eval, aes(x = eval$Predicted, y = eval$Residual)) +  # Set up canvas with outcome variable on y-axis
  geom_point() +
  geom_line(aes(y = eval$Predicted))

There is clearly a pattern (maybe degree of 3?) in the residual data which leads me to think that this dataset is not fit for linear regression. This is further supported by the accuracy of the model on the test data (\(\approx 67\%\)) which indicates that the model isn’t a very good fit for the data. This is not too surprising since I could already see at the beginning when exploring the data that there were signs that this dataset was not fit for a linear model, such as:

  • Non-linear associations in the data
  • Data not homoscedastic
  • Data has outliers

Q2.5: Does your model overfit? Which features do you think are not significant? Please justify your answers. For example, you can analyze the significance of a feature from correlation, variance, etc. (8 marks)

print(paste('The final training RMSE is:', round(a$f_train_error, digits=4)))
## [1] "The final training RMSE is: 0.1427"
print(paste('The final testing RMSE is:', round(a$f_test_error, digits=4)))
## [1] "The final testing RMSE is: 0.1453"
chart.Correlation(df2scaled, histogram=TRUE, pch=19)

Considering that the RMSE of both the datasets are so incredibly close, I think the model does a good job at not overfitting or underfitting overall - it appears to be just right. However, the training error is slightly less than the testing error, and this has been the case in every case I ran the data, so it suggests a very small degree of over fitting. This could be due to the small size of the dataset.

Generally, test RMSE > train RMSE => OVER FITTING of the data test RMSE < train RMSE => UNDER FITTING of the data

Looking back at the correlation graph, it’s very clear that only V6 has any significant correlation to the target V7, and looking at the shapes of the data in these two columns this is even more clear. From Wikipedia, it is said that “in naval architecture the Froude number is a significant figure used to determine the resistance of a partially submerged object moving through water”, so it makes sense that it would be a significant input in determining the resistance, but I am surprised that none of the other variables appear to have any sort of significant influence.

The Froude number is a dimensionless quantity used to indicate the influence of gravity on fluid motion. From my research (https://www.mermaid-consultants.com/ship-resistance-calculation.html), it is said that Ship resistance is normally a combination of four resistances:

  • Frictional resistance, due to friction layer formation between ship hull and viscous fluid
  • Wave making resistance, Energy transfer by the ship to create waves at free surface
  • Eddy resistance, Due to non-streamline flow at stern where flow separation happens
  • Air resistance, Windage area of hull and superstructure experience load by wind

Also from (https://www.sciencedirect.com/topics/engineering/residuary-resistance), it states that it is usual to group wave-making resistance, form resistance, eddy resistance and frictional resistance into one force termed => residuary resistance (ie. V7).

Learning this, it actually makes sense that none of the other elements contribute significantly to the residuary resistance. They may contribute to the other resistance elements that then contribute to the residuary resistance.

To see which variables will be the most significant for V7, I will do a backward selection on all the variables to find which variables produce the model with the lowest AIC.

model <- lm(V7 ~ ., data=df2scaled)
smodel <- step(model)
## Start:  AIC=-1170.23
## V7 ~ V1 + V2 + V3 + V4 + V5 + V6
## 
##        Df Sum of Sq     RSS      AIC
## - V2    1    0.0002  6.1816 -1172.22
## - V3    1    0.0017  6.1831 -1172.14
## - V5    1    0.0019  6.1833 -1172.13
## - V4    1    0.0020  6.1834 -1172.13
## - V1    1    0.0078  6.1892 -1171.84
## <none>               6.1814 -1170.23
## - V6    1   11.8244 18.0058  -847.21
## 
## Step:  AIC=-1172.22
## V7 ~ V1 + V3 + V4 + V5 + V6
## 
##        Df Sum of Sq     RSS     AIC
## - V1    1    0.0079  6.1895 -1173.8
## - V3    1    0.0104  6.1920 -1173.7
## - V5    1    0.0108  6.1924 -1173.7
## - V4    1    0.0133  6.1949 -1173.6
## <none>               6.1816 -1172.2
## - V6    1   11.8243 18.0059  -849.2
## 
## Step:  AIC=-1173.83
## V7 ~ V3 + V4 + V5 + V6
## 
##        Df Sum of Sq     RSS      AIC
## - V3    1    0.0105  6.2001 -1175.31
## - V5    1    0.0109  6.2005 -1175.29
## - V4    1    0.0134  6.2030 -1175.17
## <none>               6.1895 -1173.83
## - V6    1   11.8294 18.0189  -850.98
## 
## Step:  AIC=-1175.31
## V7 ~ V4 + V5 + V6
## 
##        Df Sum of Sq     RSS      AIC
## - V5    1    0.0005  6.2005 -1177.29
## - V4    1    0.0031  6.2032 -1177.16
## <none>               6.2001 -1175.31
## - V6    1   11.8255 18.0256  -852.87
## 
## Step:  AIC=-1177.29
## V7 ~ V4 + V6
## 
##        Df Sum of Sq     RSS      AIC
## - V4    1    0.0026  6.2032 -1179.16
## <none>               6.2005 -1177.29
## - V6    1   11.8251 18.0256  -854.87
## 
## Step:  AIC=-1179.16
## V7 ~ V6
## 
##        Df Sum of Sq     RSS      AIC
## <none>               6.2032 -1179.16
## - V6    1    11.825 18.0277  -856.84
summary(model)
## 
## Call:
## lm(formula = V7 ~ ., data = df2scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18850 -0.12175 -0.03011  0.09891  0.50595 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.119401   0.090762  -1.316    0.189    
## V1           0.016848   0.027472   0.613    0.540    
## V2          -0.005155   0.049956  -0.103    0.918    
## V3           0.052837   0.183049   0.289    0.773    
## V4          -0.070592   0.226498  -0.312    0.756    
## V5          -0.063424   0.208857  -0.304    0.762    
## V6           0.633621   0.026583  23.836   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1443 on 297 degrees of freedom
## Multiple R-squared:  0.6571, Adjusted R-squared:  0.6502 
## F-statistic: 94.87 on 6 and 297 DF,  p-value: < 2.2e-16
anova(model)

It appears that the model which produces the lowest AIC value of \(AIC=-1179.16\) is V7 ~ V6, so the most significant variable in determining the Residuary resistance per unit weight of displacement is the Froude number - however this is only with a linear model of degree 1.

It is possible that a higher order model would be a better fit for the data points

fmla1<- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',1,raw=TRUE)',collapse = ' + '))))
deg1 <-lm(fmla1,data=df2scaled)
interaction <- aov(deg1, data = df2scaled)

print(paste("------------------------------DEGREE 1 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 1 SUMMARY------------------------------"
summary(deg1)
## 
## Call:
## lm(formula = fmla1, data = df2scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18850 -0.12175 -0.03011  0.09891  0.50595 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.119401   0.090762  -1.316    0.189    
## poly(V1, 1, raw = TRUE)  0.016848   0.027472   0.613    0.540    
## poly(V2, 1, raw = TRUE) -0.005155   0.049956  -0.103    0.918    
## poly(V3, 1, raw = TRUE)  0.052837   0.183049   0.289    0.773    
## poly(V4, 1, raw = TRUE) -0.070592   0.226498  -0.312    0.756    
## poly(V5, 1, raw = TRUE) -0.063424   0.208857  -0.304    0.762    
## poly(V6, 1, raw = TRUE)  0.633621   0.026583  23.836   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1443 on 297 degrees of freedom
## Multiple R-squared:  0.6571, Adjusted R-squared:  0.6502 
## F-statistic: 94.87 on 6 and 297 DF,  p-value: < 2.2e-16
print(paste("------------------------------ANOVA 1 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 1 SUMMARY------------------------------"
summary(interaction)
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## poly(V1, 1, raw = TRUE)   1  0.013   0.013   0.630  0.428    
## poly(V2, 1, raw = TRUE)   1  0.008   0.008   0.363  0.547    
## poly(V3, 1, raw = TRUE)   1  0.000   0.000   0.002  0.966    
## poly(V4, 1, raw = TRUE)   1  0.000   0.000   0.013  0.910    
## poly(V5, 1, raw = TRUE)   1  0.001   0.001   0.049  0.826    
## poly(V6, 1, raw = TRUE)   1 11.824  11.824 568.134 <2e-16 ***
## Residuals               297  6.181   0.021                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla2<- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',2,raw=TRUE)',collapse = ' + '))))
deg2 <-lm(fmla2,data=df2scaled)
interaction <- aov(deg2, data = df2scaled)

print(paste("------------------------------DEGREE 2 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 2 SUMMARY------------------------------"
summary(deg2)
## 
## Call:
## lm(formula = fmla2, data = df2scaled)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.116579 -0.060889  0.002038  0.048669  0.272391 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.14652    0.14891   0.984    0.326    
## poly(V1, 2, raw = TRUE)1 -0.03227    0.04318  -0.747    0.455    
## poly(V1, 2, raw = TRUE)2  0.04680    0.04182   1.119    0.264    
## poly(V2, 2, raw = TRUE)1 -0.02234    0.09143  -0.244    0.807    
## poly(V2, 2, raw = TRUE)2  0.02351    0.03691   0.637    0.525    
## poly(V3, 2, raw = TRUE)1  0.09769    0.33013   0.296    0.767    
## poly(V3, 2, raw = TRUE)2 -0.01094    0.05441  -0.201    0.841    
## poly(V4, 2, raw = TRUE)1 -0.17161    0.48123  -0.357    0.722    
## poly(V4, 2, raw = TRUE)2  0.06233    0.14156   0.440    0.660    
## poly(V5, 2, raw = TRUE)1 -0.11409    0.36243  -0.315    0.753    
## poly(V5, 2, raw = TRUE)2  0.01287    0.06443   0.200    0.842    
## poly(V6, 2, raw = TRUE)1 -0.84523    0.04662 -18.132   <2e-16 ***
## poly(V6, 2, raw = TRUE)2  1.48002    0.04498  32.903   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06702 on 291 degrees of freedom
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9245 
## F-statistic: 310.2 on 12 and 291 DF,  p-value: < 2.2e-16
print(paste("------------------------------ANOVA 2 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 2 SUMMARY------------------------------"
summary(interaction)
##                          Df Sum Sq Mean Sq  F value Pr(>F)    
## poly(V1, 2, raw = TRUE)   2  0.018   0.009    2.039  0.132    
## poly(V2, 2, raw = TRUE)   2  0.010   0.005    1.094  0.336    
## poly(V3, 2, raw = TRUE)   2  0.000   0.000    0.028  0.973    
## poly(V4, 2, raw = TRUE)   2  0.001   0.000    0.056  0.945    
## poly(V5, 2, raw = TRUE)   2  0.000   0.000    0.032  0.969    
## poly(V6, 2, raw = TRUE)   2 16.691   8.346 1857.941 <2e-16 ***
## Residuals               291  1.307   0.004                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla3 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',3,raw=TRUE)',collapse = ' + '))))
deg3 <-lm(fmla3,data=df2scaled)
interaction <- aov(deg3, data = df2scaled)

print(paste("------------------------------DEGREE 3 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 3 SUMMARY------------------------------"
summary(deg3)
## 
## Call:
## lm(formula = fmla3, data = df2scaled)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.084031 -0.019138 -0.002433  0.016782  0.171410 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.13444    0.20222   0.665    0.507    
## poly(V1, 3, raw = TRUE)1  3.29372    4.29317   0.767    0.444    
## poly(V1, 3, raw = TRUE)2 -9.40798   12.21724  -0.770    0.442    
## poly(V1, 3, raw = TRUE)3  6.12993    7.92416   0.774    0.440    
## poly(V2, 3, raw = TRUE)1  0.19990    0.44713   0.447    0.655    
## poly(V2, 3, raw = TRUE)2 -0.60916    1.20945  -0.504    0.615    
## poly(V2, 3, raw = TRUE)3  0.41382    0.79384   0.521    0.603    
## poly(V3, 3, raw = TRUE)1 -0.07407    0.66427  -0.112    0.911    
## poly(V3, 3, raw = TRUE)2  1.05106    1.65646   0.635    0.526    
## poly(V3, 3, raw = TRUE)3 -0.77163    1.10698  -0.697    0.486    
## poly(V4, 3, raw = TRUE)1 -0.75497    0.99425  -0.759    0.448    
## poly(V4, 3, raw = TRUE)2  1.20968    1.53862   0.786    0.432    
## poly(V4, 3, raw = TRUE)3 -0.75041    0.95772  -0.784    0.434    
## poly(V5, 3, raw = TRUE)1 -0.42678    0.84076  -0.508    0.612    
## poly(V5, 3, raw = TRUE)2  0.33463    1.29710   0.258    0.797    
## poly(V5, 3, raw = TRUE)3 -0.15387    0.80055  -0.192    0.848    
## poly(V6, 3, raw = TRUE)1  0.64202    0.04863  13.202   <2e-16 ***
## poly(V6, 3, raw = TRUE)2 -2.37788    0.11558 -20.574   <2e-16 ***
## poly(V6, 3, raw = TRUE)3  2.57034    0.07581  33.904   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03018 on 285 degrees of freedom
## Multiple R-squared:  0.9856, Adjusted R-squared:  0.9847 
## F-statistic:  1084 on 18 and 285 DF,  p-value: < 2.2e-16
print(paste("------------------------------ANOVA 3 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 3 SUMMARY------------------------------"
summary(interaction)
##                          Df Sum Sq Mean Sq  F value   Pr(>F)    
## poly(V1, 3, raw = TRUE)   3  0.019   0.006    6.873 0.000175 ***
## poly(V2, 3, raw = TRUE)   3  0.010   0.003    3.511 0.015742 *  
## poly(V3, 3, raw = TRUE)   3  0.000   0.000    0.063 0.979446    
## poly(V4, 3, raw = TRUE)   3  0.001   0.000    0.390 0.760197    
## poly(V5, 3, raw = TRUE)   3  0.001   0.000    0.406 0.748735    
## poly(V6, 3, raw = TRUE)   3 17.738   5.913 6493.082  < 2e-16 ***
## Residuals               285  0.260   0.001                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla4 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',4,raw=TRUE)',collapse = ' + '))))
deg4 <-lm(fmla4,data=df2scaled)
interaction <- aov(deg4, data = df2scaled)

print(paste("------------------------------DEGREE 4 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 4 SUMMARY------------------------------"
summary(deg4)
## 
## Call:
## lm(formula = fmla4, data = df2scaled)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108116 -0.012216  0.000432  0.009588  0.144277 
## 
## Coefficients: (2 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.20373    1.31279   0.155   0.8768    
## poly(V1, 4, raw = TRUE)1  -64.51317  922.16473  -0.070   0.9443    
## poly(V1, 4, raw = TRUE)2  320.83029 4410.68173   0.073   0.9421    
## poly(V1, 4, raw = TRUE)3 -509.66455 6785.77009  -0.075   0.9402    
## poly(V1, 4, raw = TRUE)4  253.36395 3297.25655   0.077   0.9388    
## poly(V2, 4, raw = TRUE)1    1.03372   10.41229   0.099   0.9210    
## poly(V2, 4, raw = TRUE)2   -3.70244   42.83247  -0.086   0.9312    
## poly(V2, 4, raw = TRUE)3    3.96886   59.83807   0.066   0.9472    
## poly(V2, 4, raw = TRUE)4   -1.28301   26.94680  -0.048   0.9621    
## poly(V3, 4, raw = TRUE)1   -0.41672    1.36238  -0.306   0.7599    
## poly(V3, 4, raw = TRUE)2    3.38742    7.39639   0.458   0.6473    
## poly(V3, 4, raw = TRUE)3   -4.61583    9.23514  -0.500   0.6176    
## poly(V3, 4, raw = TRUE)4    1.94572    5.67872   0.343   0.7321    
## poly(V4, 4, raw = TRUE)1   -1.37201    4.70745  -0.291   0.7709    
## poly(V4, 4, raw = TRUE)2    3.55233    3.97711   0.893   0.3725    
## poly(V4, 4, raw = TRUE)3   -4.27372    4.16857  -1.025   0.3061    
## poly(V4, 4, raw = TRUE)4    1.68621    2.30955   0.730   0.4659    
## poly(V5, 4, raw = TRUE)1   -0.49561    3.20825  -0.154   0.8773    
## poly(V5, 4, raw = TRUE)2    0.18513    0.60240   0.307   0.7588    
## poly(V5, 4, raw = TRUE)3         NA         NA      NA       NA    
## poly(V5, 4, raw = TRUE)4         NA         NA      NA       NA    
## poly(V6, 4, raw = TRUE)1   -0.12669    0.07097  -1.785   0.0753 .  
## poly(V6, 4, raw = TRUE)2    1.39366    0.30599   4.555 7.83e-06 ***
## poly(V6, 4, raw = TRUE)3   -3.43925    0.46881  -7.336 2.36e-12 ***
## poly(V6, 4, raw = TRUE)4    3.00672    0.23256  12.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02398 on 281 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1412 on 22 and 281 DF,  p-value: < 2.2e-16
print(paste("------------------------------ANOVA 4 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 4 SUMMARY------------------------------"
summary(interaction)
##                          Df Sum Sq Mean Sq  F value   Pr(>F)    
## poly(V1, 4, raw = TRUE)   4  0.019   0.005    8.212 2.82e-06 ***
## poly(V2, 4, raw = TRUE)   4  0.010   0.002    4.162  0.00273 ** 
## poly(V3, 4, raw = TRUE)   4  0.002   0.000    0.714  0.58311    
## poly(V4, 4, raw = TRUE)   4  0.004   0.001    1.552  0.18746    
## poly(V5, 4, raw = TRUE)   2  0.002   0.001    1.915  0.14931    
## poly(V6, 4, raw = TRUE)   4 17.830   4.458 7752.393  < 2e-16 ***
## Residuals               281  0.162   0.001                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla5 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',5,raw=TRUE)',collapse = ' + '))))
deg5 <-lm(fmla5,data=df2scaled)
interaction <- aov(deg5, data = df2scaled)

print(paste("------------------------------DEGREE 5 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 5 SUMMARY------------------------------"
summary(deg5)
## 
## Call:
## lm(formula = fmla5, data = df2scaled)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108140 -0.012221  0.000456  0.009573  0.144253 
## 
## Coefficients: (7 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)  
## (Intercept)              -1.983e-01  9.268e-01  -0.214   0.8308  
## poly(V1, 5, raw = TRUE)1  6.535e+04  2.798e+05   0.234   0.8155  
## poly(V1, 5, raw = TRUE)2 -3.120e+05  1.336e+06  -0.234   0.8155  
## poly(V1, 5, raw = TRUE)3  4.795e+05  2.053e+06   0.234   0.8155  
## poly(V1, 5, raw = TRUE)4 -2.328e+05  9.967e+05  -0.234   0.8155  
## poly(V1, 5, raw = TRUE)5         NA         NA      NA       NA  
## poly(V2, 5, raw = TRUE)1 -1.221e+03  5.225e+03  -0.234   0.8154  
## poly(V2, 5, raw = TRUE)2  7.565e+03  3.236e+04   0.234   0.8153  
## poly(V2, 5, raw = TRUE)3 -1.713e+04  7.326e+04  -0.234   0.8153  
## poly(V2, 5, raw = TRUE)4  1.676e+04  7.165e+04   0.234   0.8152  
## poly(V2, 5, raw = TRUE)5 -5.972e+03  2.553e+04  -0.234   0.8152  
## poly(V3, 5, raw = TRUE)1  1.453e+02  6.147e+02   0.236   0.8133  
## poly(V3, 5, raw = TRUE)2 -9.212e+02  3.888e+03  -0.237   0.8129  
## poly(V3, 5, raw = TRUE)3  2.082e+03  8.769e+03   0.237   0.8125  
## poly(V3, 5, raw = TRUE)4 -1.985e+03  8.338e+03  -0.238   0.8120  
## poly(V3, 5, raw = TRUE)5  6.780e+02  2.841e+03   0.239   0.8116  
## poly(V4, 5, raw = TRUE)1  9.005e+00  3.973e+01   0.227   0.8209  
## poly(V4, 5, raw = TRUE)2 -5.504e+01  2.417e+02  -0.228   0.8200  
## poly(V4, 5, raw = TRUE)3  1.049e+02  4.608e+02   0.228   0.8200  
## poly(V4, 5, raw = TRUE)4 -5.916e+01  2.599e+02  -0.228   0.8201  
## poly(V4, 5, raw = TRUE)5         NA         NA      NA       NA  
## poly(V5, 5, raw = TRUE)1         NA         NA      NA       NA  
## poly(V5, 5, raw = TRUE)2         NA         NA      NA       NA  
## poly(V5, 5, raw = TRUE)3         NA         NA      NA       NA  
## poly(V5, 5, raw = TRUE)4         NA         NA      NA       NA  
## poly(V5, 5, raw = TRUE)5         NA         NA      NA       NA  
## poly(V6, 5, raw = TRUE)1 -1.253e-01  1.144e-01  -1.095   0.2743  
## poly(V6, 5, raw = TRUE)2  1.383e+00  7.815e-01   1.769   0.0779 .
## poly(V6, 5, raw = TRUE)3 -3.409e+00  2.064e+00  -1.651   0.0998 .
## poly(V6, 5, raw = TRUE)4  2.972e+00  2.308e+00   1.287   0.1991  
## poly(V6, 5, raw = TRUE)5  1.406e-02  9.192e-01   0.015   0.9878  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02402 on 280 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1346 on 23 and 280 DF,  p-value: < 2.2e-16
print(paste("------------------------------ANOVA 5 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 5 SUMMARY------------------------------"
summary(interaction)
##                          Df Sum Sq Mean Sq  F value   Pr(>F)    
## poly(V1, 5, raw = TRUE)   4  0.019   0.005    8.183 2.97e-06 ***
## poly(V2, 5, raw = TRUE)   5  0.010   0.002    3.325   0.0062 ** 
## poly(V3, 5, raw = TRUE)   5  0.002   0.000    0.806   0.5459    
## poly(V4, 5, raw = TRUE)   4  0.005   0.001    2.194   0.0699 .  
## poly(V6, 5, raw = TRUE)   5 17.830   3.566 6179.849  < 2e-16 ***
## Residuals               280  0.162   0.001                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")

From the analysis of the above models in their RMSE, R-squared, MSE, sum of squares, significance of F and t values, and F-statistic, we can see that a polynomial of degree 4 is probably best suited for our given dataset.

Degree 1

  • Residual standard error (RSE): 0.1443

  • R-squared: 0.6571

  • F-statistic: 94.87

  • Sum of squares (SS): 6.181

  • MSE: 0.021

    Starting from Degree 1 (linear), we can see that the only real variable that is significant to V7 is V6. Although initially these error rates appear low, it is clear that this model has a poor fit to the data when looking at these values against the other degree summaries.

Degree 2

  • Residual standard error (RSE): 0.06702

  • R-squared: 0.9275

  • F-statistic: 310.2

  • Sum of squares (SS): 1.307

  • MSE: 0.004

    There is a significant jump in the F-statistic (indicating that the variables are more likely to contribute to the output, and not chance), and R-squared, and RSE, SS and MSE have all decreased significantly. This indiciates that this model is a significantly better fit for our data, however V6 is still the only significant variable.

Degree 3

  • Residual standard error (RSE): 0.03018

  • R-squared: 0.9856

  • F-statistic: 1084

  • Sum of squares (SS): 0.260

  • MSE: 0.001

    Again, a very big jump in the F-statistic indicating that the variables are becoming more significant as the model increases in order. R-squared has also improved a fair bit and is very close to perfect. The RSE, SS and MSE have all decreased, indicating that again this model is an even better fit than the last. Although V6 is still the most significant variable, V1 is starting the show through as having significance, as is V2.

Degree 4

  • Residual standard error (RSE): 0.02398

  • R-squared: 0.991

  • F-statistic: 1412

  • Sum of squares (SS): 0.162

  • MSE: 0.001

    The R-squared and F-statistic values have increased again, and the RSE and SS have decreased as well. The MSE has not changed but this is probably because it is already so small. The R-squared value is close to perfect, however in the analysis of each of the variables, we can see that V5 does not have enough data or significance at order levels of 3 and 4 (\(V5^3\) and \(V5^4\) appear as NA in the table). V6 is still the most significant variable, with V2 also appearing to be more significant, and V1 rising in significance too.

Degree 5

  • Residual standard error (RSE): 0.02402

  • R-squared: 0.991

  • F-statistic: 1346

  • Sum of squares (SS): 0.162

  • MSE: 0.001

    We can see that both RSE and F-statistic have actually started to decrease again in this model, and that there is no change in the R-squared, SS and MSE. Looking at the significance of variables, we can also see that many of the variables are no longer contributing to the model (eg. \(V1^5\), \(V4^5\), all of V5). It is clear that although this model appears quite accurate, it is clear that this model performs worse compared to Degree 4, since it is excluding large chunks of data in an effort to fit to the model, and is increasing in error.

    We can say that for our dataset, a model of degree 4 would be a better fit, and this can be further supported by analysing the AIC values for each of the models.

model.set <- list(deg1, deg2, deg3, deg4, deg5)
model.names <- c("deg1", "deg2", "deg3", "deg4", "deg5")

aictab(model.set, modnames = model.names)

The above table is ranked in order from best to worst. We can see that a model of degree 4 has the lowest AIC value, and a degree of 1 performing the worst. Deg4 and Deg5 both have the highest log-likelihood, but Deg4 wins out, probably due to the reasons as mentioned above.

In summary, I believe my initial model underfits the data heavily due to a low R-squared value, and the fact that the data is not suitable for single degree linear regression. I also believe that in my model, the only significant variable was V6 - the Froude number, with all other variables having very little influence.

I would like to also note that in my research I have found a paper recommending for the prismatic coefficient (V2) as a function of Froude number (https://www.researchgate.net/publication/262486291_A_Practical_Design_Approach_and_RANSE-based_Resistance_Prediction_for_Medium-speed_Catamarans) which is interesting because as I found in the models with higher degrees (eg. Deg4), the V2 variable was starting to appear as more significant.

Q2.6: Use the glmnet library to built two linear regression models with Lasso and Ridge regularization, respectively. In comparison to your model, how well do these two models perform? Do the regularized models automatically filter out the less significant features? What are the differences between these two models? Please justify your answers. (8 marks)

(0:Ridge and 1:LASSO) lambda: range for the regularization parameter

Auxiliary functions
fitAndPlot <- function(train.data, train.label, alpha, lambda){
    
    # fit the model
    fit <- glmnet(x = as.matrix(train.data), y = train.label, alpha = alpha, lambda = lambda)

    # aggrigate the outputs - take the beta output of fit and transpose it (t) then add df and lambda values
    out <- as.data.frame(as.matrix(t(fit$beta)))
    out[,c('nonzero', 'lambda')] <- c(fit$df, fit$lambda)

    # reshape the outputs (for plotting)
    out.m<-melt(out, id=c('lambda', 'nonzero'))
    names(out.m) <- c('lambda', 'nonzero', 'feature', 'coefficient')

    # plot coefficients vs lambda 
    g <- ggplot(data = out.m, aes(x=lambda, y=coefficient, color=factor(feature))) + 
        geom_line() +
        ggtitle('Coefficients vs. lambda') + 
        theme_minimal()
    print(g)
    
    # plot number of nonzero coefficients (as ameasure of model complexity) vs lambda 
    g <- ggplot(data = out.m, aes(x=lambda, y=nonzero)) + 
        geom_line() + 
        scale_y_continuous(breaks=seq(0,10,1)) +
        scale_color_discrete(guide = guide_legend(title = NULL)) + 
        ggtitle('Nonzero Coefficients vs. lambda') + 
        theme_minimal()
    print(g)
    
    # run the predictions
    train.predict <- predict(fit, newx=as.matrix(train.data))
    test.predict <- predict(fit, newx=as.matrix(test.data2))

    # calculate the errors
    error <- data.frame('lambda' = out$lambda, 
                    'train' = sqrt(colSums((train.predict - train.label)^2)/nrow(train.predict)),
                    'test' = sqrt(colSums((test.predict - test.target2)^2)/nrow(test.predict)))
    print(tail(error, 1))
    error.m <- melt(error, id='lambda')
    names(error.m) <- c('lambda', 'set', 'RMSE')

    # plot sum of squarred error for train and test sets vs lambda 
    g <- ggplot(data = error.m, aes(x=lambda, y = RMSE, color = factor(set))) + 
        geom_line() +  
        ylim(0,6) +
        scale_color_discrete(guide = guide_legend(title = NULL)) + 
        ggtitle('RMSE vs. lambda') + 
        theme_minimal()
    print(g)
}
LASSO Regularisation
fitAndPlot (train.data2, train.target2, alpha=1, lambda = c(0:1000)/10000)

##       lambda     train      test
## s1000      0 0.1421077 0.1447743

RIDGE Regularisation
fitAndPlot (train.data2, train.target2, alpha=0, lambda = c(0:1000)/10000)

##       lambda     train      test
## s1000      0 0.1421072 0.1447694

Running these two models was much much faster than running my gradient descent model, going to show the effects of having a penalty can lead to a much faster response. While initially I was a little worred about the error rate because the model learned so quickly, looking at the RMSE of both Ridge and LASSO, they are almost exactly the same as my GD model, and as each other.

However here’s a very clear difference in the final models between the two. LASSO has very quickly identified that most of the variables are not important to the output and by the end of the run, we have only one non-zero coefficient which is V6 (Froude number).

Ridge has kept all 6 variables and has set them all at different values.

I believe that Ridge ran slightly faster than LASSO. In reality on large datasets, Ridge is much faster, but keeps all variables which may not be ideal.

I’d like to run these 3 variations on larger datasets to see how they perform.

Question 3 - Logistic Regression (45 marks)

In this question, you are required to implement a Logistic Regression model to classify whether a person has liver disease or not. Please read the sub-questions below carefully for detailed instructions.

  1. Check out the Blood Transfusion Service Center Data Set at https://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29
  2. Perform data preprocessing to determine and remove invalid samples. Split the data into a training set and a test set with a ratio of 7:3. (2 marks)
  3. Develop a Logistic Regression model that uses batch gradient descent for optimization. Visualize the parameter updating process, test accuracy (ACC) in each iteration, and the cost convergence process. Please note that you need to develop your model step-by-step. Built-in models in any released R package, like glm, are NOT allowed to use in this question. (10 marks)
  4. Investigate the influence of different learning rates on the training process and answer what happened if you apply a too small or a too large learning rate. (5 marks)
  5. Implement and compare the batch gradient descent and the stochastic gradient descent and discuss your findings (e.g., convergence speed). Visualize the comparison in terms of updating process and the cost convergence process. (6 marks)
  6. Develop a K-fold (K = 10) cross-validation to evaluate your model in step 3. Please note that you need to write R codes to explicitly show how you perform the K-fold cross-validation. Built-in validation methods are not allowed to use. Different metrics, e.g., ACC, Recall, precision, etc. should be used to evaluate your model. (8 marks)
  7. Use different values of K (from 5 to N, where N denotes the sample number) and summarize the corresponding changes of your model performances. Visualize and explain the changes. (6 marks)
  8. How can you modify the cost function to prevent overfitting? Discuss the possibility of adding regularization term(s) and summarize the possible changes in the gradient descent process. (8 marks)

Q3.1: Data Exploration

# rm(list = ls(all.names = TRUE))
df3 = read.csv('ILPD.csv')
colnames(df3)
##  [1] "Age"           "TB"            "DB"            "Alkphos"      
##  [5] "Sgpt"          "Sgot"          "TP"            "ALB"          
##  [9] "A.G.Ratio"     "Liver.Patient"

Columns:

  • Age
    Age of the patient

  • TB => Total Bilirubin
    Bilirubin is an orange-yellow pigment that occurs normally when part of your red blood cells break down. A bilirubin test measures the amount of bilirubin in your blood. It’s used to help find the cause of health conditions like jaundice, anemia, and liver disease. Normal results for a total bilirubin test are 1.2 milligrams per deciliter (mg/dL) for adults

  • DB => Direct Bilirubin
    Bilirubin attached by the liver to glucuronic acid, a glucose-derived acid, is called direct, or conjugated, bilirubin. Normal results for direct bilirubin are generally 0.3 mg/dL

  • Alkphos => Alkaline Phosphotase
    Alkaline phosphatase (ALP) is an enzyme in a person’s blood that helps break down proteins. Using an ALP test, it is possible to measure how much of this enzyme is circulating in a person’s blood. The normal range for serum ALP level is 20 to 140 IU/L. High levels may indicate liver damage.

  • Sgpt => Alamine Aminotransferase Alamine aminotransferase (ALT) is an enzyme found primarily in the liver and kidney. ALT is increased with liver damage and is used to screen for and/or monitor liver disease. A normal range is considered to be between 7 to 56 units per liter of serum.

  • Sgot => Aspartate Aminotransferase
    Aspartate Aminotransferase (AST) is an enzyme that is found mostly in the liver, but also in muscles. When your liver is damaged, it releases AST into your bloodstream. An AST blood test measures the amount of AST in your blood. The test can help your health care provider diagnose liver damage or disease. A normal range is considered to be between 10 to 40 units per liter of serum.

  • TP => Total Proteins
    Albumin and globulin are two types of protein in your body. The total protein test measures the total amount albumin and globulin in your body. Normal range is between 6 and 8.3 grams per deciliter (g/dL). Low values may indicate liver disease.

  • ALB => Albumin
    Albumin is produced only in the liver, and is the major plasma protein that circulates in the bloodstream. A low serum albumin indicates poor liver function. Normal range is 3.4 to 5.4 g/dL.

  • A.G.Ratio => Albumin and Globulin Ratio
    The albumin/globulin ratio is the amount of albumin in the serum divided by the globulins. The ratio is used to try to identify causes of change in total serum protein. A Decrease in albumin without decrease in globulins (ie. A.G.Ratio < 1) can indicate severe liver disease. A good ratio is 1:1.

  • Liver.Patient
    Indicates whether the patient has liver disease or not. 1 indicates disease, 2 indicates healthy.

Q3.2: Perform data preprocessing to determine and remove invalid samples. Split the data into a training set and a test set with a ratio of 7:3. (2 marks)

str(df3)
## 'data.frame':    583 obs. of  10 variables:
##  $ Age          : int  65 62 62 58 72 46 26 29 17 55 ...
##  $ TB           : num  0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
##  $ DB           : num  0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
##  $ Alkphos      : int  187 699 490 182 195 208 154 202 202 290 ...
##  $ Sgpt         : int  16 64 60 14 27 19 16 14 22 53 ...
##  $ Sgot         : int  18 100 68 20 59 14 12 11 19 58 ...
##  $ TP           : num  6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
##  $ ALB          : num  3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
##  $ A.G.Ratio    : num  0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
##  $ Liver.Patient: int  1 1 1 1 1 1 1 1 2 1 ...
colSums(is.na(df3))
##           Age            TB            DB       Alkphos          Sgpt 
##             0             0             0             0             0 
##          Sgot            TP           ALB     A.G.Ratio Liver.Patient 
##             0             0             0             4             0

There is a total of 583 rows of data in the dataset, and that 4 rows in column A.G.Ratio have null values. The values have also been read in as numerical and integer data types which I will leave as is. I will remove the rows with null values and plot the data in boxplots to determine any significant outliers.

# remove null values
df3 <- na.omit(df3)
# confirm 4 rows with null values have been removed
dim(df3)
## [1] 579  10
M <- cor(df3)
corrplot(M, method="number")

# str(df3)


# check if any errors in output column
unique(df3[,"Liver.Patient"])
## [1] 1 2
# change Liver.Patient label, where 1=has disease, 2=0=no disease
df3$Liver.Patient[df3$Liver.Patient == 2] <- 0


summary(df3)
##       Age              TB               DB            Alkphos      
##  Min.   : 4.00   Min.   : 0.400   Min.   : 0.100   Min.   :  63.0  
##  1st Qu.:33.00   1st Qu.: 0.800   1st Qu.: 0.200   1st Qu.: 175.5  
##  Median :45.00   Median : 1.000   Median : 0.300   Median : 208.0  
##  Mean   :44.78   Mean   : 3.315   Mean   : 1.494   Mean   : 291.4  
##  3rd Qu.:58.00   3rd Qu.: 2.600   3rd Qu.: 1.300   3rd Qu.: 298.0  
##  Max.   :90.00   Max.   :75.000   Max.   :19.700   Max.   :2110.0  
##       Sgpt              Sgot              TP             ALB       
##  Min.   :  10.00   Min.   :  10.0   Min.   :2.700   Min.   :0.900  
##  1st Qu.:  23.00   1st Qu.:  25.0   1st Qu.:5.800   1st Qu.:2.600  
##  Median :  35.00   Median :  42.0   Median :6.600   Median :3.100  
##  Mean   :  81.13   Mean   : 110.4   Mean   :6.482   Mean   :3.139  
##  3rd Qu.:  61.00   3rd Qu.:  87.0   3rd Qu.:7.200   3rd Qu.:3.800  
##  Max.   :2000.00   Max.   :4929.0   Max.   :9.600   Max.   :5.500  
##    A.G.Ratio      Liver.Patient  
##  Min.   :0.3000   Min.   :0.000  
##  1st Qu.:0.7000   1st Qu.:0.000  
##  Median :0.9300   Median :1.000  
##  Mean   :0.9471   Mean   :0.715  
##  3rd Qu.:1.1000   3rd Qu.:1.000  
##  Max.   :2.8000   Max.   :1.000

We can see correlations between TB and DB which makes sense as DB contributes to TB, Sgot and Sgpt are also strongly related, as is ALB and TP and A.G.Ratio and ALB - again this makes sense since ALB contributes to both TP and A.G.Ratio.

Next split data into test and train sets

## [1] 405   9
## [1] 174   9

Q3.3: Develop a Logistic Regression model that uses batch gradient descent for optimization. Visualize the parameter updating process, test accuracy (ACC) in each iteration, and the cost convergence process. Please note that you need to develop your model step-by-step. Built-in models in any released R package, like glm, are NOT allowed to use in this question. (10 marks)

Taking the following steps is necessary to build a logistic regression:

  • Implement sigmoid function \(\sigma ( w \cdot x)\) and initialize weight vector w, learning rate \(\eta\) and stopping criterion \(\epsilon\)
  • Repeat the followings until the improvement becomes negligible:
    • For each datapoint in the training data do:
      \(w^{(\tau+1)} := w^{(\tau)} - \eta(\sigma(w \cdot x) - t_n)x_n\)
Auxiliary Functions
c0 <- 0
c1 <- 1

# auxiliary function that calculate a cost function
cost <- function (w, X, T, c0){
    sig <- sigmoid(w, X)
    return(sum(ifelse(T==c0, 1-sig, sig)))
}


# auxiliary function that predicts class labels
pred <- function(w, X, c0, c1){
    sig <- sigmoid(w, X)
    return(ifelse(sig>0.5, c1,c0))
}


# Sigmoid function (=p(C1|X))
sigmoid <- function(w, x){
    return(1.0/(1.0+exp(-w%*%t(cbind(1,x)))))    
}


# Accuracy function
get_accuracy <- function(w) {
  # predict targets with current model
  preds <- t(as.matrix(pred(w,train.data3,c0,c1)))
  
  # create confusion matrix, converting targets and predictions to factors
  tab <- table(factor(preds, levels=min(train.target3):max(train.target3)), 
      factor(train.target3, levels=min(train.target3):max(train.target3)))
  
  # calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
  acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
  
  return(acc)
}


plot_coeffs <- function(W, model) {
  # convert W matrix to a dataframe and  melt for plotting
  W.df <- as.data.frame(W); names(W.df) <- c('w0','w1','w2','w3','w4', 'w5', 'w6', 'w7', 'w8', 'w9')
  W.df$tau <- 1:nrow(W.df)

  W.m <-melt(W.df, id='tau');
  names(W.m) <- c('tau', 'coefficients', 'values')

  ggplot(data=W.m, aes(x=tau, y=values, color=coefficients)) + geom_line() + 
    labs(title='Estimated Coefficients', subtitle=model) +
    theme_minimal()
}


plot_costs <- function(costs,eta) {
  ggplot(data=costs, aes(x=tau, y=log(cost)), color=black) +
    # geom_line() + ggtitle('Log of Cost over Time') + theme_minimal()
    geom_line() + labs(title='Log of Cost over Time', subtitle=paste("Learning rate:", eta)) + theme_minimal()
}


plot_acc <- function(acc, eta) {
  ggplot(data=acc, aes(x=tau, y=accuracy), color=black) +
    # geom_line() + ggtitle('% Accuracy over Time') + theme_minimal()
    geom_line() + labs(title='% Accuracy over Time', subtitle=paste("Learning rate:", eta)) + theme_minimal()

}
executeLogReg Function
executeLogReg <- function(data, target, tau.max, eta, epsilon, algorithm) {
  # Initializations
  tau <- 1 # iteration counter
  terminate <- FALSE
  
  ## Just a few name/type conversion to make the rest of the code easy to follow
  X <- as.matrix(data) # rename just for convenience
  T <- ifelse(target==c0,0,1) # rename just for convenience
  
  W <- matrix(,nrow=tau.max, ncol=(ncol(X)+1)) # to be used to store the estimated coefficients

  set.seed(ncol(W))
  W[1,] <- runif(ncol(W)) # initial weight (any better idea?)
  # W[1,] <- 0
  
  # project data using the sigmoid function (just for convenient)
  Y <- sigmoid(W[1,],X)
  
  costs <- data.frame('tau'=1:tau.max)  # to be used to trace the cost in each iteration
  costs[1, 'cost'] <- cost(W[1,],X,T, c0)
  
  model_acc <- data.frame('tau'=1:tau.max)  # to be used to trace the accuracy in each iteration
  model_acc[1, 'accuracy'] <- get_accuracy(W[1,])
  
  while(!terminate){
    # check termination criteria:
    terminate <- tau >= tau.max | cost(W[tau,],X,T, c0)<=epsilon
        
    # # check termination criteria:
    if (tau >= tau.max | cost(W[tau,],X,T, c0) <=epsilon) {terminate<-TRUE;break}

    if (algorithm == "bgd") {
      
      # Get gradient
      Y <- sigmoid(W[tau,],X)
      
      # Update the weights
      W[(tau+1),] <- W[tau,] - eta * 1/nrow(X) * (Y-T) %*% cbind(1, X)
      
      # record the cost:
      costs[(tau+1), 'cost'] <- cost(W[tau,],X,T, c0)
      
      # record the accuracy
      model_acc[(tau+1), 'accuracy'] <- get_accuracy(W[tau,])
      
      # update the counter:
      tau <- tau + 1
      
      # decrease learning rate:
      eta = eta * 0.999

    }
    
    if (algorithm == "sgd") {

      # shuffle data:
      train.index <- sample(1:train_len, train_len, replace = FALSE)
      X <- X[train.index,]
      T <- T[train.index]

      # for each datapoint:
      for (i in 1:train_len) {
        # check termination criteria:
        if (tau >= tau.max | cost(W[tau,],X,T, c0) <=epsilon) {terminate<-TRUE;break}

        Y <- sigmoid(W[tau,],X)

        # Update the weights
        W[(tau+1),] <- W[tau,] - eta * (Y[i]-T[i]) * cbind(1, t(X[i,]))

        # record the cost:
        costs[(tau+1), 'cost'] <- cost(W[tau,],X,T, c0)

        # record the accuracy
        model_acc[(tau+1), 'accuracy'] <- get_accuracy(W[tau,])

        # update the counter:
        tau <- tau + 1

        # decrease learning rate:
        eta = eta * 0.999
      }
    }

  }
  costs <- costs[1:tau, ] # remove the NaN tail of the vector (in case of early stopping)
  model_acc <- model_acc[1:tau, ]
  W <- W[1:tau,]
  f_coeffs <- W[tau,]
  f_acc <- tail(model_acc, 1)
  
  result <- list("costs" = costs, "acc" = model_acc,  "W" = W, "f_coeffs" = f_coeffs, "f_acc" = f_acc)
  
  return(result)
}
bgd.1 <- executeLogReg(train.data3, train.target3, 5000, 0.1, 0.01, "bgd")
print(paste("The final accuracy is:",round(bgd.1$f_acc[2], digits=2),"%"))
## [1] "The final accuracy is: 69.88 %"
cat("The final coefficients are:", bgd.1$f_coeffs)
## The final coefficients are: -1.685641 1.15582 13.25407 8.140503 0.1855517 0.7293219 -0.1552714 -13.89243 -8.953661 -2.72975
Visualise cost over time
plot_costs(bgd.1$costs, 0.1)

The cost of the model drops quite quickly at the beginning, but then smooths out around the 2000th iteration, after which there is no significant improvement in the cost. The model could probably stop after the 2500th iteration.

Visualise Coefficients over Time
plot_coeffs(bgd.1$W, "BGD, Learning rate: 0.1")

The model has spent a significant amount of time estimating the coefficients, and it appears that it is still modifying them after the 5000th iteration. The values jump around a fair bit at the beginning, but this is due to the initial random values that the coefficients were assigned.

Looking at the final accuracy of the data, the model appears to have learned the data fairly quickly and performs average.

Visualise % Accuracy over Time
plot_acc(bgd.1$acc, 0.1)

The accuracy graph is very interesting because it jumps around so much and is very consistent in behaviour with the cost graph as can be seen below:

grid.arrange(
  plot_acc(bgd.1$acc, 0.1),
  plot_costs(bgd.1$costs, 0.1)
)

It is interesting to see how the accuracy starts off so low most likely due to the random initialisation of W, but then continues to jump around a lot while the model tries to learn. It is also interesting to see that the lower the accuracy, the lower the cost

It is interesting to note that even though the model is changing the value of the ceofficients, even at the 5000th iteration, it doesn’t appear to have any cost or accuracy improvements, which is why I stopped it.

Evaluate model on test data

I will look at accuracy, sensitivity and specificity.
In medical settings, sensitivity and specificity are the two most reported ratios from the confusion matrix.

Sensitivity (AKA Recall): true positive rate (true positive)/(true positive+false negative). This describes what proportion of patients with liver disease are correctly identified as having liver disease. If high, we aren’t missing many people with liver disease. If low, then we are missing these people and they won’t receive the treatment they should.

Specificity: true negative rate (true negative)/(true negative+false positive). What proportion of healthy patients are correctly identified as healthy? If high, we are marking healthy as healthy. If low, we have false positives and people will either incorrectly receive treatment or in some other way incorrectly respond to a false positive.

# auxiliary function that predicts class labels
prob <- function(sigval){
    # sigval <- sig(x)
    return(ifelse(sigval>0.5, 1, 0))
}

# Sigmoid function (=p(C1|X))
sig <- function(x){
    return(1.0/(1.0+exp(-x)))    
}

evalRegModel <- function(data, target, coeffs) {
  model_results <- setNames(data.frame(matrix(ncol = 3)), c("actual", "score", "predicted"))
  
  size <- length(target)
  
  for (i in 1:size) {
    row <- as.matrix(cbind('v0'=1, data[i,]))
    score <- sig(sum(data.frame(mapply(`*`, row, coeffs))))
    yhat <- prob(score)
    y <- target[i]
    model_results <- rbind(model_results, c(y, score, yhat))
    
  }
  # print(model_results)
  model_results <- na.omit(model_results)
    # create confusion matrix, converting targets and predictions to factors
  tab <- table(factor(model_results$predicted, levels=min(test.target3):max(test.target3)), 
      factor(model_results$actual, levels=min(test.target3):max(test.target3)))
  # print(tab)
  # calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
  acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
  # sensitivity AKA recall
  sen <- round(tab[1,1]/sum(tab[1,]), digits=2)
  spec <- round(tab[2,2]/sum(tab[2,]), digits=2)
  prec <- round(tab[1,1]/sum(tab[,1]), digits=2)
  fscore <- round(2*((prec * sen)/(prec + sen)), digits=2)
  
  result <- list("acc" = acc, "sen" = sen,  "spec" = spec, "prec" = prec, "fscore" = fscore)
  
  return(result)
}

test_acc <- evalRegModel(test.data3, test.target3, bgd.1$f_coeffs)

print(paste("The model has a", round(test_acc$acc, digits=2), "% accuracy rate on the test data"))
## [1] "The model has a 63.79 % accuracy rate on the test data"
print(paste("Sensitivity:", test_acc$sen))
## [1] "Sensitivity: 0.34"
print(paste("Specifity:", test_acc$spec))
## [1] "Specifity: 0.8"
print(paste("Precision:", test_acc$prec))
## [1] "Precision: 0.49"
print(paste("Fscore:", test_acc$fscore))
## [1] "Fscore: 0.4"

Looking at the model’s performance on the test data set, we can see that it has \(\approx 64 \%\) accuracy in correctly predicting whether a patient has liver disease or not, which is not that great.

We have a low sensitivity of \(0.34\), meaning our model is missing many patients with liver disease, so these patients would not be identified for treatment.

Our specificity is a much better value at \(0.80\), indicating that our model is much better at identifying patients who are healthy.

In a real medical setting, it would be important to have a high sensitivity AND high specificity, so I would say this model performs quite poorly. This is probably due to the size of the dataset, and the fact that we have used all of the variables available, whereas in reality we should only use the variables which are better at predicting disease.

Q3.4: Investigate the influence of different learning rates on the training process and answer what happened if you apply a too small or a too large learning rate. (5 marks)

bgd10 <- executeLogReg(train.data3, train.target3, 5000, 10, 0.01, "bgd")
bgd1.0 <- executeLogReg(train.data3, train.target3, 5000, 1.0, 0.01, "bgd")
bgd.01 <- executeLogReg(train.data3, train.target3, 5000, 0.01, 0.01, "bgd")
bgd.001 <- executeLogReg(train.data3, train.target3, 5000, 0.001, 0.01, "bgd")
bgd.0001 <- executeLogReg(train.data3, train.target3, 5000, 0.0001, 0.01, "bgd")
c1 <- plot_costs(bgd10$costs, 10)
c2 <- plot_costs(bgd1.0$costs, 1.0)
c3 <- plot_costs(bgd.1$costs, 0.1)
c4 <- plot_costs(bgd.01$costs, 0.01)
c5 <- plot_costs(bgd.001$costs, 0.001)
c6 <- plot_costs(bgd.0001$costs, 0.0001)

grid.arrange(
  c1, c2, c3, c4, c5, c6,
  ncol=2
)

a1 <- plot_acc(bgd10$acc, 10)
a2 <- plot_acc(bgd1.0$acc, 1.0)
a3 <- plot_acc(bgd.1$acc, 0.1)
a4 <- plot_acc(bgd.01$acc, 0.01)
a5 <- plot_acc(bgd.001$acc, 0.001)
a6 <- plot_acc(bgd.0001$acc, 0.0001)

grid.arrange(
  a1, a2, a3, a4, a5, a6,
  ncol=2
)

By changing the learning rate, we can see that when the learning rate is high, our accuracy tends to be quite high, however the cost tends to jump around all over the place and it takes the model a longer time to plateau out.

  • \(\eta = 10\) We can see the cost starts to flatten out around the 2700th iteration, but it is still learning as we can see the line is jumping around a lot (it is still quite thick). A model with this learning rate would take many more iterations before it could confidently converge. We can also see that the accuracy of this model is still fluctuating, but it appears to be around 70%.

  • \(\eta = 1\) The model starts to flatten out around the 2500th iteration, but there is still some bouncing around in costs, until it gets to the 4000th iteration when it starts to smooth. The accuracy also starts to smooth at the 4500th iteration and is around the 70% mark.

  • \(\eta = 0.1\) This is the model that I have decided to go with in my previous answer. The cost begins to smooth much earlier than the previous rates at around the 2500th iteration and appears to be at the same sort of level of \(\approx 5.65\). The accuracy smooths out at the 2800th iteration, and is around the 70% mark

  • \(\eta = 0.01\) Similar performance to 0.1, however the cost smooths out slightly later as does the accuracy, however both perform quite similarly

  • \(\eta = 0.001\) We start to see improvements in the cost - the model now flattens out to a cost of \(5.5\) at around the 2200th iteration, however the accuracy drops down to around 65%.

  • \(\eta = 0.0001\) Cost is around the same as the previous model at \(5.56\), but the graph is smooth from the beginning and plateaus out very quickly. However accuracy drops again down to 64%.

We can see that when we change the learning rate on our model, it greatly affects the cost and accuracy and the model needs to do a lot of work before the costs start to flatten out. A large learning rate at the beginning tends to more iterations but has a fairly high accuracy. A small learning rate shows that although the cost is lower, the accuracy is less

Q3.5: Implement and compare the batch gradient descent and the stochastic gradient descent and discuss your findings (e.g., convergence speed). Visualize the comparison in terms of updating process and the cost convergence process. (6 marks)

sgd.1 <- executeLogReg(train.data3, train.target3, 5000, 0.1, 0.01, "sgd")
df_compare <- setNames(data.frame(matrix(ncol = 2)), c("SGD", "BGD"))
df_compare[1,1] <- round(sgd.1$f_acc[2], digits=2)
df_compare[1,2] <- round(bgd.1$f_acc[2], digits=2)
row.names(df_compare) <- "% Accuracy"

print(df_compare)
##              SGD   BGD
## % Accuracy 49.66 69.88
Plot Coefficient Estimation
csgd <- plot_coeffs(sgd.1$W, "SGD")
cbgd <- plot_coeffs(bgd.1$W, "BGD")

grid.arrange(
  csgd, cbgd  
)

Plot Cost over time
costsgd <- plot_costs(sgd.1$costs, 0.1)
costbgd <- plot_costs(bgd.1$costs, 0.1)

grid.arrange(
  costsgd, costbgd  
)

Both the SGD and BGD methods have very similar accuracy rates, however the costs in SGD are much higher and don’t appear to smooth out the way that they do in BGD. Also the coefficient estimation graph is interesting. The SGD estimates jumped around (at some point it estimated w4 to be over 100!) a lot more than BGD and took longer to converge, which may be due to the size of the dataset. SGD works much faster than BGD on very large datasets, to perhaps our dataset of 403 rows of data is not large enough to show the benefits of SGD. Also the error rate of SGD is much higher than BGD and does not smooth out the same way BGD does. This is expected due to the random nature of SGD and it appears to be getting smaller as it goes through more iterations and will probably flatten out in further iterations.

Q3.6: Develop a K-fold (K = 10) cross-validation to evaluate your model in step 3. Please note that you need to write R codes to explicitly show how you perform the K-fold cross-validation. Built-in validation methods are not allowed to use. Different metrics, e.g., ACC, Recall, precision, etc. should be used to evaluate your model. (8 marks)

Please note that I have two methods of training the models in kfoldcv - using glm() and using my own batch gradient descent (BGD) algorithm as defined above. Running the BGD algorithm takes my computer up to an hour so I have done my analysis on the glm outputs instead. However I have left the code to run in on the BGD in the function. To run it on BGD, please uncomment lines 1524-1526, and comment out 1528-1552.

kfoldcv <- function(k, data) {
  # shuffle data
  set.seed(42)
  rows <- sample(nrow(data))
  shuff_data <- data[rows, ]
  
  N <- nrow(df3)

  Min_index <- 0
  folds <- list()

  # create folds
  for(i in 1:k) {
    
      if(i==k) {
          Max_index <- (floor(N/k))*i+(N%%k) # For last fold make the size as the proportion N/k plus the remainder
      }
      else {
          Max_index <- (floor(N/k))*i    
      }
    
      folds[i] <- list(shuff_data[(Min_index+1):Max_index,]) # Append the fold to the list
      Min_index <- Max_index # Change the starting point for next loop
  }
  
  Kresults <- setNames(data.frame(matrix(ncol = 4)), c("accuracy", "precision", "recall", "fscore")) # Store the results
  
  for(j in 1:k) { # For each fold
    test <- data.frame(folds[j]) # get one test chunk
    train <- list() 
    for(i in 1:k) { 
      if(i!=j) { # Avoid binding the current test fold 
          train <- rbind(train, data.frame(folds[i])) # attach other folds to be train set           
      }
    }

    # the below two lines can be used to run the gradient descent function defined in Q3.3
    # fit <- executeLogReg(train[,-10], train[,10], 1000, 0.1, 0.1,"bgd")
    # preds <- evalRegModel(test[,-10], test[,10], fit$f_coeffs)
    # Kresults <- rbind(Kresults, c(preds$acc, preds$prec, preds$sen, preds$fscore))

    #  *************************** Using glm ***************************
    fit <- glm(Liver.Patient ~ ., data=train, family="binomial" (link='logit'))

    # predict the targets
    probs <- data.frame(predict(fit, test, type="response"))
    preds <- ifelse(probs > 0.5,1,0)

    output <- cbind(test, preds)

    # create confusion matrix, converting targets and predictions to factors
    tab <- table(factor(output[,11], levels=min(test):max(test)),
      factor(output[,10], levels=min(test):max(test)))

    # calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
    acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
    # sensitivity AKA recall
    sen <- round(tab[1,1]/sum(tab[1,]), digits=2)
    # precision
    prec <- round(tab[1,1]/sum(tab[,1]), digits=2)
    # f-score
    fscore <- round(2*((prec * sen)/(prec + sen)), digits=2)

    Kresults <- rbind(Kresults, c(acc, prec, sen, fscore))
    #  *************************** to here  ***************************
  }
  Kresults <- na.omit(Kresults)
  Kresults <- rbind(Kresults, c(mean(Kresults$accuracy),
                                mean(Kresults$precision),
                                mean(Kresults$recall),
                                mean(Kresults$fscore)))
  Kresults <- na.omit(Kresults)
  final <- tail(Kresults,n=1)

  res <- list("Kresults" = Kresults, "final" = final)
  
  return(res)
}
k10 <- kfoldcv(10, df3)
print(k10$final)
##     accuracy precision recall fscore
## 111 71.43541     0.275  0.514  0.333

We can see that running 10-fold cross validation shows us that the accuracy is actually slightly higher than our initial evaluation in Q3.3. However the recall and precision is much lower, and fscore shows that the model is looking quite underfitted. This is interesting because I expected the model to perform a lot better in precision and recall.

Q3.7: Use different values of K (from 5 to N, where N denotes the sample number) and summarize the corresponding changes of your model performances. Visualize and explain the changes. (6 marks)

kAll <- setNames(data.frame(matrix(ncol = 5)), c("K-value", "accuracy", "precision", "recall", "fscore")) # Store the results

k5 <- kfoldcv(5, df3)
row <- cbind("K-value"=5, k5$final)
kAll <- rbind(kAll, row)

row <- cbind("K-value"=10, k10$final)
kAll <- rbind(kAll, row)

k25 <- kfoldcv(25, df3)
row <- cbind("K-value"=25, k25$final)
kAll <- rbind(kAll, row)

k50 <- kfoldcv(50, df3)
row <- cbind("K-value"=50, k50$final)
kAll <- rbind(kAll, row)

k100 <- kfoldcv(100, df3)
row <- cbind("K-value"=100, k100$final)
kAll <- rbind(kAll, row)

k250 <- kfoldcv(250, df3)
row <- cbind("K-value"=250, k250$final)
kAll <- rbind(kAll, row)

k450 <- kfoldcv(450, df3)
row <- cbind("K-value"=450, k450$final)
kAll <- rbind(kAll, row)

kN <- kfoldcv(nrow(df3), df3)
row <- cbind("K-value"=nrow(df3), kN$final)
kAll <- rbind(kAll, row)

kAll <- na.omit(kAll)
kAll
ga <- ggplot(kAll,aes(x=`K-value`,y=accuracy)) + 
        geom_line() +
        geom_point(shape=21, color="black", fill="blue", size=3) +
        ggtitle("K-value vs. Accuracy")

gb <- ggplot(kAll,aes(x=`K-value`,y=precision)) + 
        geom_line() +
        geom_point(shape=21, color="black", fill="pink", size=3) +
        ggtitle("K-value vs. Precision")

gc <- ggplot(kAll,aes(x=`K-value`,y=recall)) + 
        geom_line() +
        geom_point(shape=21, color="black", fill="purple", size=3) +
        ggtitle("K-value vs. Recall")

gd <- ggplot(kAll,aes(x=`K-value`,y=fscore)) + 
        geom_line() +
        geom_point(shape=21, color="black", fill="orange", size=3) +
        ggtitle("K-value vs. F-score")

grid.arrange(ga, gb, gc, gd)

  • Precision: measure of the correctly identified positive cases from all the predicted positive cases
  • Recall: measure of the correctly identified positive cases from all the actual positive cases
  • Accuracy: measure of all the correctly identified cases
  • F1-score: mean of Precision and Recall and gives a better measure of the incorrectly classified cases than Accuracy

We can see that as the number K increases, our models performance also increases, up to the point where when K=N, we have 100% accuracy. This is of course expected since the model has more iterations of training, but I am surprised to see that there is such a difference in the F-score and the accuracy.

Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes. Considering there were many more positive cases in our data, than negative, it would make sense that F-score is a better indicator for model accuracy, and so it appears that our model doesn’t really have a good F-score until we get to K=250 onwards. So even though accuracy appears high especially at small k-values, I don’t think the model actually performs that well until it gets to K=250, when we see a big improvement in the F-score. Of course, K=450 performs even better and is practically perfect.

There is also a marked improvement in the precision and recall as we go up in K. For precision, the model is quite imprecise up to K=100. This could be due to our dataset having many more positive points (1s), than negative points (0s) hence the model had more data to learn from. Precision improves drastically at K=250 - jumping from 0.58 to 0.86.

It is interesting to note how well the model’s recall is at lower K values. We can see at K=100 the recall is already very high at 0.86. Again I think this is due to the fact that there are many more positive cases in the dataset than there are negative, but it shows just how much that affects the recall.

Q3.8: How can you modify the cost function to prevent overfitting? Discuss the possibility of adding regularization term(s) and summarize the possible changes in the gradient descent process. (8 marks)

In the gradient descent algorithms, we can we can apply regularization terms to the model’s weights. This will add a cost to the loss function large weights (or parameter values). This forces a simpler model that learns only the relevant patterns in the train data.

There are two types of regularisation techniques: L1 and L2. L1 regularization will add a cost with regards to the absolute value of the parameters, resulting in some of the weights to equal zero.

L2 regularization will add a cost with regards to the squared value of the parameters, resulting in smaller weights. Larger coefficients strongly penalized because of the squaring.

L2 is better suited to data that is very complex and difficult to be modelled accurately, whereas L1 is better for data that is easily modelled. L1 is also robust to outliers and Can be used for feature selection, but is slower to converge than L2.

What this would mean for our changes in the gradient descent algorithm is that there is an added parameter in our cost function when we update our weights.

If we added an L2 regularisation term to our cost function, this would result in the Variance being reduced i.e. coefficients are reduced in magnitude.

If we added an L1 regularisation, it would result in reduced Variance and will also act as a feature selector. This could also result in a different number of coefficients in our calculations.

Assuming our cost function is \(f(x)\), then we are trying to calculate \(min_x f(x)\). When adding regularisation, our cost function now looks like: \(min_x f(x) +\lambda g(x)\) where \(\lambda\) is a hyperparameter that helps control the relative importance of the regulariser function \(g(x)\).

An interesting point that I came across in my research shows that the optimal point may not always be at 0. When we look at our cost function, our aim is to minimise it as much as possible so it becomes close to 0. However this may not always be the case.

An L2-regularised cost function is smooth, meaning the optimum value is at a stationary point (where derivative = 0). The stationary point of \(f(x)\) can get very small when \(\lambda\) is increased but won’t equal 0.

L1-regularized cost function is non-smooth, meaning it may not be differentiable at 0. Again, we assume the optimum of a function is either at a point where the derivative = 0, but in a non-smooth function (such as \(y=|x|\)), this would be at corner (an irregularity) so it’s possible that the optimal point of \(f(x)\) is 0 even if 0 isn’t the stationary point of f.